Automatic Parallel Program Checkpointing in Message-Passing Environments
نویسنده
چکیده
Problem of efficient cluster resources usage is very important, because of high demand for parallel computations. Checkpointing allows to manage cluster computing time more efficiently. In this article parallel programs checkpointing problems are discussed and implementation of automatic parallel checkpointing systems for MPI programs is presented. It is based on simple user-space portable checkpointing library with two different parallel program analysis approaches to checkpoint consistency.
منابع مشابه
A Heuristic Approach for the Automatic Insertion of Checkpoints in Message-Passing Codes
Checkpointing tools may be typically implemented at two different abstraction levels: at the system level or at the application level. The latter has become a more popular alternative due to its flexibility and the possibility of operating in different environments. However, application-level checkpointing tools often require the user to manually insert checkpoints in order to ensure that certa...
متن کاملA Parallel Implementation of the Everglades Landscape Fire Model in Networks of Workstations
This paper presents a low-communication overhead and highperformance data parallelism implementation of the Everglades Landscape Fire Model (ELFM) in a network of workstations (NOWs). Checkpointing and rollback techniques were used to handle the spread of fire which is a dynamic and irregular component of the model. A synchronous checkpointing mechanism was used in the parallel ELFM code using ...
متن کاملBlocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols
A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPIs has led to the development of several fault tolerant MPI environments. Different approaches a...
متن کاملAutomatic Differentiation for Message-Passing Parallel Programs
Many applications require the derivatives of functions defined by computer programs. Automatic differentiation (AD) is a means of developing code to compute the derivatives of complicated functions accurately and efficiently, without the difficulties associated with developing correct code by hand. We discuss some of the issues involved in developing automatic differentiation tools for parallel...
متن کاملTowards Data Persistency for Fault-tolerance Using MPI Semantics
As the size and complexity of high-performance computing hardware, as well as applications increase, the likelihood of a hardware failure during the execution time of large distributed applications is no longer negligible. On the other hand, frequent checkpointing of full application state or even full compute node memory is prohibitively expensive. Thus, application-level checkpointing of only...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007